This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. #15762
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
larroy
changed the title
Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes.
[WIP] Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes.
Aug 6, 2019
@mxnet-label-bot add [pr-work-in-progress] |
zhreshold
approved these changes
Aug 6, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
7 tasks
samskalicky
approved these changes
Aug 7, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
larroy
changed the title
[WIP] Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes.
Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes.
Aug 8, 2019
mseth10
reviewed
Aug 8, 2019
mseth10
reviewed
Aug 8, 2019
mseth10
reviewed
Aug 8, 2019
@mxnet-label-bot remove [pr-work-in-progress] |
@mxnet-label-bot add [pr-awaiting-review] |
mseth10
approved these changes
Aug 9, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
anirudhacharya
pushed a commit
to anirudhacharya/mxnet
that referenced
this pull request
Aug 20, 2019
…cal concurrency crashes. (apache#15762) * Refactor LibraryInitializer so it's thread safe. Fixes apache#13438 Fixes apache#14979 * Refactor around lib loading * Fix lint * CR * Add option to choose between OMP implementations * Fix bug * Fix from CR
larroy
added a commit
to larroy/mxnet
that referenced
this pull request
Dec 5, 2019
… sporadical concurrency crashes. (apache#15762)" This reverts commit bfd3bb8.
larroy
added a commit
to larroy/mxnet
that referenced
this pull request
Dec 7, 2019
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before apache#15762 was introduced. Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially
larroy
added a commit
to larroy/mxnet
that referenced
this pull request
Dec 7, 2019
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before apache#15762 was introduced. Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially
anirudh2290
pushed a commit
that referenced
this pull request
Dec 11, 2019
* Prevent after-fork number of OMP threads being bigger than 1. This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before #15762 was introduced. Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially * add C++ unit test * Add comment
ptrendx
pushed a commit
that referenced
this pull request
Dec 12, 2019
* Fix test_gluon.py:test_sync_batchnorm when number of GPUS > 4 * Prevent after-fork number of OMP threads being bigger than 1. This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before #15762 was introduced. Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially * add C++ unit test * Add comment
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR improves initialization of the Library which uses unsafe passing of environment variables to child processes through pthread_atfork handlers. Calling setenv concurrently with getenv leads to random crashes which have been reported in several ocassions but so far haven't been fixed.
This patch avoids using getenv and setenv and it should address those random crashes while improving handling of the library intialization in the LibraryInitializer object.
Fixes #13438
Fixes #14979
Fixes minor issues with #15755
Aggregated dynamic library code in LibraryInitializer.
Verified Dtor is called on exit.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes